I. Load libraries and the data

data = read_delim('bank.csv', delim = ';')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   age = col_double(),
##   job = col_character(),
##   marital = col_character(),
##   education = col_character(),
##   default = col_character(),
##   balance = col_double(),
##   housing = col_character(),
##   loan = col_character(),
##   contact = col_character(),
##   day = col_double(),
##   month = col_character(),
##   duration = col_double(),
##   campaign = col_double(),
##   pdays = col_double(),
##   previous = col_double(),
##   poutcome = col_character(),
##   y = col_character()
## )
data %>%
        head() %>%
        kable()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
35 management single tertiary no 747 no no cellular 23 feb 141 2 176 3 failure no

II. Exploratory data Analysis

Check data type and content

df_status(data)
##     variable q_zeros p_zeros q_na p_na q_inf p_inf      type unique
## 1        age       0    0.00    0    0     0     0   numeric     67
## 2        job       0    0.00    0    0     0     0 character     12
## 3    marital       0    0.00    0    0     0     0 character      3
## 4  education       0    0.00    0    0     0     0 character      4
## 5    default       0    0.00    0    0     0     0 character      2
## 6    balance     357    7.90    0    0     0     0   numeric   2353
## 7    housing       0    0.00    0    0     0     0 character      2
## 8       loan       0    0.00    0    0     0     0 character      2
## 9    contact       0    0.00    0    0     0     0 character      3
## 10       day       0    0.00    0    0     0     0   numeric     31
## 11     month       0    0.00    0    0     0     0 character     12
## 12  duration       0    0.00    0    0     0     0   numeric    875
## 13  campaign       0    0.00    0    0     0     0   numeric     32
## 14     pdays       0    0.00    0    0     0     0   numeric    292
## 15  previous    3705   81.95    0    0     0     0   numeric     24
## 16  poutcome       0    0.00    0    0     0     0 character      4
## 17         y       0    0.00    0    0     0     0 character      2

There are no missing values

Check the response variable

table(data$y)/nrow(data)
## 
##      no     yes 
## 0.88476 0.11524

There are fewer yes on term deposit. Sampling method is needed on splitting train and test

Check numeric data

plot1 = inspect_num(data) %>%
        show_plot()+
        theme_minimal()
ggplotly(plot1)

Most numeric characters are skewed. Based on the attribute information, the columns duration, campaign, pdays, and previous are based on the customer contacts. Probably better to leave this columns as is.

Try to transform balance column since this is probably log-normally distributed. There will be removed observations because of negative and 0 values.

plot2 = data %>%
        ggplot(aes(balance))+
        geom_histogram(fill = 'steelblue')+
        scale_x_log10(label = comma)+
        labs(y = '', x = 'Log balance', title = 'Log distribution of balance')+
        theme_minimal()

ggplotly(plot2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Check if there are still outliers after log transformation

outliers_balance = boxplot(log(data$balance), plot = FALSE)$out

data %>%
        filter(log(balance) %in% outliers_balance, balance > 100) %>%
        kable()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
60 retired married primary no 71188 no no cellular 6 oct 205 1 -1 0 unknown no
quantile(data$balance)
##    0%   25%   50%   75%  100% 
## -3313    69   444  1480 71188

It would be better to remove this entry since this might affect the model

corr = data %>%
        select_if(is.numeric) %>%
        cor()

corrplot(corr, method = 'number')

The fields previous and pdays are highly correlated because both are related to campaign contacts. Values on both of these fields are also dependent.

Check categorical data

Check frequency of categorical data

inspect_cat(data) %>%
        show_plot()

Plot first three demographic data related to bank client

plot3 = data %>%
        ggplot(aes(x = job, fill =  education))+
        geom_bar(position =  'fill')+
        coord_flip()+
        facet_grid(~marital)+
        labs(y = 'proportions', x= '', title = 'Job by marital status and education')+
        scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = percent)+
        scale_fill_brewer(palette = 'GnBu')+
        theme_minimal()+
        theme(legend.position = 'bottom')

ggplotly(plot3)

Some insights on this graph

  • There is no divorced student in our dataset. All divorced with unknown job have tertiary education. Entrepreneur and retired have diversed education and marital status.

  • We can see here the distribution of education and marital status per job. Most management jobs have tertiary level of education whereas the services and admin are dominantly secondary education regardless of your marital status.

Plot the other three bank client data, housing loan, personal loan, and default

plot4 = data %>%
        ggplot(aes(default, fill = loan))+
        geom_bar(position = 'fill', alpha = .7)+
        scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = percent) +
        scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
        labs(x=  'Credit card defaulted', fill = 'Personal Loan',
             title = 'Percentage of individuals with personal loan by credit card default')+
        theme(legend.position = 'top')+
        theme_minimal()

plot4

plot5 = data %>%
        ggplot(aes(default, fill = housing))+
        geom_bar(position = 'fill', alpha = .7)+
        scale_y_continuous(breaks = seq(0, 1, .2), 
                     label = percent) +
        scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
        labs(x=  'Credit card has default', fill = 'Housing loan',
             title = 'Percentage of individuals with housing loan by credit card default')+
        theme_minimal()

plot5

Unlike the first graph, the proportion of individuals with a housing loan is the same whether they have a default credit. This is probably because housing loan is normally taken by anyone. The variable personal loan would be a better predictor whether a person will default compared to housing loan field.

Possible questions the bank might be interested in

1. Is the previous marketing campaign a good indicator whether a client will get a term deposit?

plot6 = data %>%
        group_by(poutcome, y) %>%
        count() %>%
        group_by(y) %>%
        mutate(percentage = n/ sum(n)) %>%
        ggplot(aes(reorder(poutcome, percentage), percentage, fill = y))+
        geom_col(position = 'dodge', alpha = .7)+
        scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
        labs(x = 'Previous campaign outcome', y = 'Percentage', fill = 'W/ term deposit',
             title = 'Outcome of previous campaign and term deposit')+
        theme_minimal()+
        coord_flip()

plot6

2. What is the balance distribution of clients with and without term deposit?

plot8 = data %>%
        ggplot(aes(balance,  fill = y, color = y))+
        geom_density(alpha = .3)+
        scale_x_log10(labels = comma)+
        scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray'))+
        scale_color_manual(values = c('yes' = 'tomato', 'no' = 'gray'), guide = F)+
        labs(title = 'Balance distribution by obtained term deposit', x = 'Log balance', fill = 'W/ term deposit')+
        theme_minimal()

plot8

The peak of clients with term deposit is slightly to the right implying those who get term deposit have higher balance.

3. What jobs are more likely to get a personal loan?

plot9=  data %>%
        group_by(job, loan) %>%
        count() %>%
        group_by(loan) %>%
        mutate(percentage = n/ sum(n)) %>%
        mutate(highlight = ifelse((job %in% c('blue-collar','admin.','services','entrepreneur')),'yes','no')) %>%
        ggplot(aes(reorder(job, percentage), percentage, fill = loan))+
        geom_col(position = 'dodge', aes(alpha = highlight))+
        scale_alpha_manual(values = c('yes' = 1, 'no' = .4), guide = F)+
        scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray'))+
        labs(x = '', y = 'Percentage', fill = 'Has loan',
             title = 'Percentage of clients with personal loan by job')+
        theme(legend.position = 'none')+
        theme_minimal()+
        coord_flip()
        
plot9

Blue collar job, admin, services, and entrepreneur jobs are more likely to get a personal loan.

4. How does a client’s education affect their balance?

Note: Observations with balance less than 5000 only

plot10 = data %>%
        filter(balance < 5000) %>%
        group_by(education) %>%
        mutate(median_balance = median(balance)) %>%
        ggplot(aes(balance))+
        geom_histogram(binwidth = 250, fill = 'steelblue')+
        geom_vline(aes(xintercept = median_balance), color = 'black', linetype = 'dashed')+
        theme_minimal()+
        facet_grid(education ~ .)+
        labs(title = 'Distribution of balance by education', y = '', x = 'balance', caption = 'Black dashed line is median')

ggplotly(plot10)

Almost same shape (right tailed) with tertiary level of education having the highest median and max. Primary and secondary levels of education also have higher occurrences of negative balance

5. Is there a relationship on the client’s age and balance?

Note: Balance > 0. Removed around 700 observations

plot11 = data %>%
        filter(balance > 0) %>%
        ggplot(aes(age, balance))+
        geom_point(alpha = .7, aes(color = y))+
        scale_y_log10(label = comma)+
        geom_smooth(color = 'black', size = 1.2, alpha = .8)+
        labs(x = 'age', y = 'Log balance', title = 'Age vs Log balance', color = 'Term deposit')+
        scale_color_manual(values = c('yes' = 'tomato', 'no' = 'gray', guide = F))+
        theme_minimal()

ggplotly(plot11)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

There is a slightly increasing trend which is more noticeable on higher age groups. Also most on the bottom points did not avail term deposit.

6. Does the contact duration affect whether the client will get a term deposit?

plot12 = data %>%
        ggplot(aes(x = as.factor(y), y = duration, fill = y))+
        geom_violin(width = 1.4, alpha = .7)+
        stat_summary(fun = median, geom = 'point', shape = 22 , size = 2)+
        scale_fill_manual(values = c('yes' = 'tomato', 'no' = 'gray'))+
        labs(x = 'W/ term deposit', y=  'contact duration', title = 'Contact duration distribution by term deposit',
             caption = 'Violin plot with median (square)')+
        theme_minimal()+
        theme(legend.position = 'none')
       
ggplotly(plot12)

From the data, those who get a term deposit has higher contact duration.

III Modeling

Change data type

Change character into factors

data = data %>%
  mutate_if(is.character, as.factor)

glimpse(data)
## Rows: 4,521
## Columns: 17
## $ age       <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 3...
## $ job       <fct> unemployed, services, management, management, blue-collar...
## $ marital   <fct> married, married, single, married, married, single, marri...
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertia...
## $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ balance   <dbl> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374,...
## $ housing   <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes,...
## $ loan      <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no...
## $ contact   <fct> cellular, cellular, cellular, unknown, unknown, cellular,...
## $ day       <dbl> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, ...
## $ month     <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, ap...
## $ duration  <dbl> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113,...
## $ campaign  <dbl> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, ...
## $ pdays     <dbl> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, ...
## $ previous  <dbl> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, ...
## $ poutcome  <fct> unknown, failure, failure, unknown, unknown, failure, oth...
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, ...

Remove/Add/ Change data

Identify variables with near zero variance

nz = nearZeroVar(data)
colnames(data[,nz])
## [1] "default" "pdays"

We can remove variables default and pdays since they don’t have strong predicting powers. Also pdays is highly correlated with previous.

Remove these columns and remove the outlier

data = data %>%
  filter(!balance == max(balance))

data = data[,-nz]

Split train and test set

80% training and 20% test. Split the sample on response variable y so that test and train set will have the same proportion of y

in_train  = createDataPartition(data$y, p = .8, list = F)
train_data = data[in_train,]
test_data = data[-in_train,]

Create train with down sample and upsample to balance response variable

#834 rows
train_data_down = downSample(x = train_data %>%
                        select(-y),
                   y = train_data$y,
                   yname = 'y')

#6400 rows
train_data_up = upSample(x = train_data %>%
                        select(-y),
                   y = train_data$y,
                   yname = 'y')

table(train_data_down$y)/nrow(train_data_down)
## 
##  no yes 
## 0.5 0.5

Model Building

Model setup

Set up train control for resampling.

  • Kfold cross validation
  • 10 folds
  • Compute for class probability
  • Multiclass summary function to include AUC aside from accuracy, sensitivity, and specificity in resampling computation.
train_control = trainControl(method = 'cv', number = 10, classProbs =T, summaryFunction = multiClassSummary)

Setup parallel computing

library(doParallel)
cores = detectCores()
cl = makeCluster(cores[1]-1)
#Register cluster
registerDoParallel(cl)

Logistic regression model

set.seed(100)
glm_model = train(y ~., data = train_data_down, method = 'glm', family = 'binomial', trControl = train_control)
glm_model
## Generalized Linear Model 
## 
## 834 samples
##  14 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 750, 751, 751, 750, 751, 751, ... 
## Resampling results:
## 
##   logLoss    AUC        prAUC      Accuracy  Kappa      F1         Sensitivity
##   0.4740668  0.8720093  0.8400509  0.798537  0.5969431  0.8041056  0.8247387  
##   Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision  Recall   
##   0.7720674    0.7870392       0.8160755       0.7870392  0.8247387
##   Detection_Rate  Balanced_Accuracy
##   0.4124068       0.798403

Predict using glm model

glm_preds = predict(glm_model, test_data)
confusionMatrix(glm_preds, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  664  21
##        yes 135  83
##                                          
##                Accuracy : 0.8272         
##                  95% CI : (0.801, 0.8514)
##     No Information Rate : 0.8848         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.426          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.79808        
##             Specificity : 0.83104        
##          Pos Pred Value : 0.38073        
##          Neg Pred Value : 0.96934        
##              Prevalence : 0.11517        
##          Detection Rate : 0.09192        
##    Detection Prevalence : 0.24142        
##       Balanced Accuracy : 0.81456        
##                                          
##        'Positive' Class : yes            
## 

This model has high accuracy but low positive predictive value. There were a lot of predicted yes as compared to actual yes. In real life this would incur additional cost if we predict a client to get a term deposit even though the client would not. But if there’s a budget, this might be acceptable

Train using the same model but with the upsampled train data

set.seed(100)
glm_model_up = train(y ~., data = train_data_up, method = 'glm', family = 'binomial', trControl = train_control)
glm_model_up
## Generalized Linear Model 
## 
## 6400 samples
##   14 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 5760, 5760, 5760, 5760, 5760, 5760, ... 
## Resampling results:
## 
##   logLoss    AUC        prAUC      Accuracy   Kappa      F1       Sensitivity
##   0.4041624  0.9077314  0.8983478  0.8235938  0.6471875  0.82678  0.841875   
##   Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision  Recall  
##   0.8053125    0.8126312       0.8361503       0.8126312  0.841875
##   Detection_Rate  Balanced_Accuracy
##   0.4209375       0.8235938
glm_model_up_pred = predict(glm_model_up, test_data)
confusionMatrix(glm_model_up_pred, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  662  24
##        yes 137  80
##                                           
##                Accuracy : 0.8217          
##                  95% CI : (0.7951, 0.8461)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4059          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.76923         
##             Specificity : 0.82854         
##          Pos Pred Value : 0.36866         
##          Neg Pred Value : 0.96501         
##              Prevalence : 0.11517         
##          Detection Rate : 0.08859         
##    Detection Prevalence : 0.24031         
##       Balanced Accuracy : 0.79888         
##                                           
##        'Positive' Class : yes             
## 

It seems that the model with upsampled sample is slightly more accurate but still have low positive predictive value.

One advantage of using simpler model such as logistic regression is that you can check the estimate and significance of each variable

summary(glm_model_up)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4594  -0.5858  -0.0060   0.5976   2.3582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -9.871e-02  3.451e-01  -0.286 0.774859    
## age                -1.193e-02  4.492e-03  -2.657 0.007889 ** 
## `jobblue-collar`   -8.191e-01  1.442e-01  -5.681 1.34e-08 ***
## jobentrepreneur    -5.542e-01  2.283e-01  -2.427 0.015214 *  
## jobhousemaid       -3.224e-01  2.535e-01  -1.272 0.203445    
## jobmanagement      -4.005e-01  1.468e-01  -2.728 0.006373 ** 
## jobretired          6.673e-01  1.955e-01   3.413 0.000642 ***
## `jobself-employed` -4.640e-01  2.186e-01  -2.123 0.033782 *  
## jobservices        -6.448e-01  1.671e-01  -3.860 0.000114 ***
## jobstudent          4.766e-01  2.552e-01   1.867 0.061883 .  
## jobtechnician      -4.057e-01  1.360e-01  -2.984 0.002847 ** 
## jobunemployed      -4.347e-01  2.322e-01  -1.872 0.061157 .  
## jobunknown          7.504e-01  3.457e-01   2.171 0.029931 *  
## maritalmarried     -1.252e-01  1.130e-01  -1.108 0.267689    
## maritalsingle      -5.690e-02  1.328e-01  -0.429 0.668284    
## educationsecondary  1.691e-01  1.257e-01   1.345 0.178728    
## educationtertiary   3.987e-01  1.427e-01   2.795 0.005196 ** 
## educationunknown   -4.496e-01  2.288e-01  -1.965 0.049398 *  
## balance             2.468e-06  1.308e-05   0.189 0.850370    
## housingyes         -3.929e-01  8.354e-02  -4.703 2.56e-06 ***
## loanyes            -9.621e-01  1.192e-01  -8.072 6.90e-16 ***
## contacttelephone    1.178e-01  1.457e-01   0.809 0.418560    
## contactunknown     -1.330e+00  1.238e-01 -10.742  < 2e-16 ***
## day                 1.040e-02  4.914e-03   2.117 0.034273 *  
## monthaug           -6.591e-01  1.447e-01  -4.555 5.25e-06 ***
## monthdec            3.029e-01  4.817e-01   0.629 0.529527    
## monthfeb           -3.153e-01  1.848e-01  -1.706 0.087934 .  
## monthjan           -1.780e+00  2.368e-01  -7.517 5.61e-14 ***
## monthjul           -1.149e+00  1.557e-01  -7.378 1.61e-13 ***
## monthjun           -2.121e-01  1.808e-01  -1.173 0.240616    
## monthmar            1.552e+00  2.729e-01   5.686 1.30e-08 ***
## monthmay           -8.038e-01  1.436e-01  -5.597 2.18e-08 ***
## monthnov           -9.065e-01  1.582e-01  -5.730 1.00e-08 ***
## monthoct            1.184e+00  2.390e-01   4.956 7.20e-07 ***
## monthsep            1.067e-01  3.319e-01   0.321 0.747899    
## duration            6.079e-03  1.751e-04  34.720  < 2e-16 ***
## campaign           -1.165e-01  1.813e-02  -6.425 1.32e-10 ***
## previous           -5.836e-02  2.867e-02  -2.036 0.041764 *  
## poutcomeother       1.056e+00  1.705e-01   6.194 5.86e-10 ***
## poutcomesuccess     2.625e+00  2.223e-01  11.809  < 2e-16 ***
## poutcomeunknown    -4.235e-01  1.361e-01  -3.112 0.001856 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8872.3  on 6399  degrees of freedom
## Residual deviance: 5076.6  on 6359  degrees of freedom
## AIC: 5158.6
## 
## Number of Fisher Scoring iterations: 6

Sample significant variables

  • poutcomesuccess : positive
  • duration : positive
  • unknown contact : negative
  • loanyes : negative

Random forest

Try a more complex model with tuning parameters to increase model accuracy.

set.seed(100)

#set up tuning grid parameter
# 2,3,4 different mtry
tunegrid = expand.grid(.mtry=c(2:8))

rf_model = train(y ~., data = train_data_up, method = 'rf', tuneGrid = tunegrid, trControl = train_control)
rf_model
## Random Forest 
## 
## 6400 samples
##   14 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 5760, 5760, 5760, 5760, 5760, 5760, ... 
## Resampling results across tuning parameters:
## 
##   mtry  logLoss     AUC        prAUC      Accuracy   Kappa      F1       
##   2     0.32940082  0.9648945  0.9605005  0.9073438  0.8146875  0.9054030
##   3     0.21548497  0.9936597  0.9860055  0.9501563  0.9003125  0.9482608
##   4     0.15130195  0.9987207  0.9824974  0.9654687  0.9309375  0.9642947
##   5     0.11636163  0.9995269  0.9568789  0.9709375  0.9418750  0.9700589
##   6     0.09810270  0.9996938  0.9292311  0.9725000  0.9450000  0.9716944
##   7     0.08776087  0.9997627  0.8877371  0.9739062  0.9478125  0.9731940
##   8     0.08230689  0.9997446  0.8427196  0.9756250  0.9512500  0.9750030
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.8871875    0.9275000    0.9249028       0.8919291       0.9249028
##   0.9140625    0.9862500    0.9852632       0.9199784       0.9852632
##   0.9334375    0.9975000    0.9973251       0.9375371       0.9973251
##   0.9428125    0.9990625    0.9990238       0.9459864       0.9990238
##   0.9459375    0.9990625    0.9990260       0.9488463       0.9990260
##   0.9487500    0.9990625    0.9990249       0.9513363       0.9990249
##   0.9521875    0.9990625    0.9990280       0.9544558       0.9990280
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.8871875  0.4435938       0.9073438        
##   0.9140625  0.4570312       0.9501563        
##   0.9334375  0.4667188       0.9654687        
##   0.9428125  0.4714062       0.9709375        
##   0.9459375  0.4729687       0.9725000        
##   0.9487500  0.4743750       0.9739062        
##   0.9521875  0.4760937       0.9756250        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
rf_model_pred = predict(rf_model, test_data)
confusionMatrix(rf_model_pred, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  756  65
##        yes  43  39
##                                           
##                Accuracy : 0.8804          
##                  95% CI : (0.8574, 0.9008)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 0.68420         
##                                           
##                   Kappa : 0.3537          
##                                           
##  Mcnemar's Test P-Value : 0.04331         
##                                           
##             Sensitivity : 0.37500         
##             Specificity : 0.94618         
##          Pos Pred Value : 0.47561         
##          Neg Pred Value : 0.92083         
##              Prevalence : 0.11517         
##          Detection Rate : 0.04319         
##    Detection Prevalence : 0.09081         
##       Balanced Accuracy : 0.66059         
##                                           
##        'Positive' Class : yes             
## 

This model has higher accuracy but lower sensitivity in prediction. The model also overfitted in predicting yes. Probably because training data is balanced whereas test data is not. This model, however, has higher specificity and lower logloss and accurately predicts those that are TRUE NEGATIVE better.

Use random forest on downsampled data.

set.seed(100)

#set up tuning grid parameter
# 2,3,4 different mtry
tunegrid = expand.grid(.mtry=c(2:8))

rf_model_down = train(y ~., data = train_data_down, method = 'rf', tuneGrid = tunegrid, trControl = train_control)
rf_model_down
## Random Forest 
## 
## 834 samples
##  14 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 750, 751, 751, 750, 751, 751, ... 
## Resampling results across tuning parameters:
## 
##   mtry  logLoss    AUC        prAUC      Accuracy   Kappa      F1       
##   2     0.5050709  0.8596773  0.8315820  0.7829891  0.5660924  0.7900410
##   3     0.4793906  0.8736657  0.8440478  0.8057803  0.6117180  0.8047278
##   4     0.4619833  0.8793575  0.8517905  0.8201377  0.6403791  0.8141260
##   5     0.4540130  0.8779485  0.8491756  0.8225617  0.6451712  0.8150070
##   6     0.4428250  0.8823579  0.8556516  0.8189185  0.6377818  0.8105794
##   7     0.4400921  0.8810941  0.8522443  0.8224900  0.6449387  0.8141937
##   8     0.4388059  0.8796008  0.8507686  0.8201520  0.6402436  0.8127466
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.8129501    0.7533101    0.7705675       0.8005727       0.7705675
##   0.7984901    0.8135889    0.8146338       0.8021661       0.8146338
##   0.7913473    0.8493031    0.8427088       0.8051856       0.8427088
##   0.7840302    0.8613240    0.8531463       0.8015348       0.8531463
##   0.7790360    0.8587689    0.8482237       0.7977085       0.8482237
##   0.7837979    0.8612079    0.8502314       0.8021353       0.8502314
##   0.7838560    0.8563879    0.8473055       0.8006350       0.8473055
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.8129501  0.4064400       0.7831301        
##   0.7984901  0.3992255       0.8060395        
##   0.7913473  0.3955967       0.8203252        
##   0.7840302  0.3919822       0.8226771        
##   0.7790360  0.3895726       0.8189024        
##   0.7837979  0.3919679       0.8225029        
##   0.7838560  0.3919822       0.8201220        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
rf_model_pred_down = predict(rf_model_down, test_data)
confusionMatrix(rf_model_pred_down, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  628  20
##        yes 171  84
##                                           
##                Accuracy : 0.7885          
##                  95% CI : (0.7604, 0.8147)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3639          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.80769         
##             Specificity : 0.78598         
##          Pos Pred Value : 0.32941         
##          Neg Pred Value : 0.96914         
##              Prevalence : 0.11517         
##          Detection Rate : 0.09302         
##    Detection Prevalence : 0.28239         
##       Balanced Accuracy : 0.79684         
##                                           
##        'Positive' Class : yes             
## 

Now this model has higher sensitivity but overall lower accuracy. Random forest model is less robust compared to GLM and is more affected by the sampling method on the training data.

Support Vector Machine

#model setup
set.seed(100)
svm_grid = expand.grid(C = seq(0, 2, length = 10))

# fit model
svm_model = train(y ~., data = train_data_up, method = 'svmLinear', tuneGrid = svm_grid, trControl = train_control)
svm_model
## Support Vector Machines with Linear Kernel 
## 
## 6400 samples
##   14 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 5760, 5760, 5760, 5760, 5760, 5760, ... 
## Resampling results across tuning parameters:
## 
##   C          logLoss    AUC        prAUC      Accuracy   Kappa      F1       
##   0.0000000        NaN        NaN        NaN        NaN        NaN        NaN
##   0.2222222  0.4066066  0.9061504  0.8961309  0.8201563  0.6403125  0.8235164
##   0.4444444  0.4065720  0.9060430  0.8960407  0.8212500  0.6425000  0.8242844
##   0.6666667  0.4064384  0.9060391  0.8959929  0.8212500  0.6425000  0.8241351
##   0.8888889  0.4065123  0.9060342  0.8960005  0.8212500  0.6425000  0.8244095
##   1.1111111  0.4067220  0.9060420  0.8960048  0.8217188  0.6434375  0.8247215
##   1.3333333  0.4066551  0.9060439  0.8960016  0.8207813  0.6415625  0.8238538
##   1.5555556  0.4067011  0.9060254  0.8959933  0.8218750  0.6437500  0.8249997
##   1.7777778  0.4067157  0.9060234  0.8959825  0.8214063  0.6428125  0.8243619
##   2.0000000  0.4067915  0.9059932  0.8959376  0.8218750  0.6437500  0.8247914
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##         NaN          NaN          NaN             NaN             NaN
##   0.8387500    0.8015625    0.8091409       0.8325960       0.8091409
##   0.8384375    0.8040625    0.8108553       0.8328585       0.8108553
##   0.8375000    0.8050000    0.8114569       0.8321921       0.8114569
##   0.8390625    0.8034375    0.8105103       0.8332504       0.8105103
##   0.8387500    0.8046875    0.8113994       0.8332242       0.8113994
##   0.8378125    0.8037500    0.8106235       0.8321405       0.8106235
##   0.8396875    0.8040625    0.8110855       0.8339685       0.8110855
##   0.8381250    0.8046875    0.8113334       0.8327172       0.8113334
##   0.8384375    0.8053125    0.8118613       0.8330976       0.8118613
##   Recall     Detection_Rate  Balanced_Accuracy
##         NaN        NaN             NaN        
##   0.8387500  0.4193750       0.8201563        
##   0.8384375  0.4192187       0.8212500        
##   0.8375000  0.4187500       0.8212500        
##   0.8390625  0.4195313       0.8212500        
##   0.8387500  0.4193750       0.8217188        
##   0.8378125  0.4189063       0.8207813        
##   0.8396875  0.4198438       0.8218750        
##   0.8381250  0.4190625       0.8214063        
##   0.8384375  0.4192188       0.8218750        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 1.555556.
svm_model_pred = predict(svm_model, test_data)
confusionMatrix(svm_model_pred, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  659  23
##        yes 140  81
##                                           
##                Accuracy : 0.8195          
##                  95% CI : (0.7928, 0.8441)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4053          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7788          
##             Specificity : 0.8248          
##          Pos Pred Value : 0.3665          
##          Neg Pred Value : 0.9663          
##              Prevalence : 0.1152          
##          Detection Rate : 0.0897          
##    Detection Prevalence : 0.2447          
##       Balanced Accuracy : 0.8018          
##                                           
##        'Positive' Class : yes             
## 
#model setup
set.seed(100)
svm_grid = expand.grid(C = seq(0, 2, length = 10))

# fit model
svm_model_down = train(y ~., data = train_data_down, method = 'svmLinear', tuneGrid = svm_grid, trControl = train_control)
svm_model_down
## Support Vector Machines with Linear Kernel 
## 
## 834 samples
##  14 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 750, 751, 751, 750, 751, 751, ... 
## Resampling results across tuning parameters:
## 
##   C          logLoss    AUC        prAUC      Accuracy   Kappa      F1       
##   0.0000000        NaN        NaN        NaN        NaN        NaN        NaN
##   0.2222222  0.4619795  0.8774985  0.8467440  0.8093087  0.6185298  0.8162311
##   0.4444444  0.4636142  0.8776036  0.8466057  0.8056942  0.6112669  0.8114941
##   0.6666667  0.4651613  0.8776755  0.8466338  0.8056942  0.6112613  0.8119909
##   0.8888889  0.4617902  0.8772690  0.8459624  0.8020941  0.6040870  0.8082144
##   1.1111111  0.4649215  0.8775040  0.8461614  0.8044750  0.6088531  0.8103528
##   1.3333333  0.4656764  0.8772136  0.8459026  0.8056655  0.6112298  0.8109730
##   1.5555556  0.4634754  0.8771583  0.8458640  0.8056799  0.6112441  0.8119140
##   1.7777778  0.4641748  0.8771597  0.8459145  0.8044607  0.6088093  0.8103678
##   2.0000000  0.4632163  0.8766965  0.8454792  0.8056512  0.6111903  0.8113126
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##         NaN          NaN          NaN             NaN             NaN
##   0.8463415    0.7721254    0.7901163       0.8354428       0.7901163
##   0.8366434    0.7745064    0.7895366       0.8273075       0.7895366
##   0.8390825    0.7720674    0.7882624       0.8288409       0.7882624
##   0.8343206    0.7696864    0.7852680       0.8242163       0.7852680
##   0.8367015    0.7720674    0.7872328       0.8272494       0.7872328
##   0.8343206    0.7768293    0.7902680       0.8254984       0.7902680
##   0.8390825    0.7720674    0.7879856       0.8288531       0.7879856
##   0.8367015    0.7720093    0.7871557       0.8270850       0.7871557
##   0.8367015    0.7743902    0.7888671       0.8274406       0.7888671
##   Recall     Detection_Rate  Balanced_Accuracy
##         NaN        NaN             NaN        
##   0.8463415  0.4231784       0.8092334        
##   0.8366434  0.4183735       0.8055749        
##   0.8390825  0.4195783       0.8055749        
##   0.8343206  0.4171830       0.8020035        
##   0.8367015  0.4183735       0.8043844        
##   0.8343206  0.4171830       0.8055749        
##   0.8390825  0.4195783       0.8055749        
##   0.8367015  0.4183735       0.8043554        
##   0.8367015  0.4183735       0.8055459        
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.2222222.
svm_model_pred_down = predict(svm_model_down, test_data)
confusionMatrix(svm_model_pred_down, test_data$y, positive = 'yes')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  667  21
##        yes 132  83
##                                           
##                Accuracy : 0.8306          
##                  95% CI : (0.8045, 0.8545)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4322          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.79808         
##             Specificity : 0.83479         
##          Pos Pred Value : 0.38605         
##          Neg Pred Value : 0.96948         
##              Prevalence : 0.11517         
##          Detection Rate : 0.09192         
##    Detection Prevalence : 0.23810         
##       Balanced Accuracy : 0.81644         
##                                           
##        'Positive' Class : yes             
## 

SVM with downsampled training data has similar statistics with SVM trained on upsampled training data. The results of SVM also have very similar metrics with the results from GLM.

Compare models

Comparing model metrics

Models from upsampled data

models_upsampled = list(glm_up = glm_model_up, rf_up = rf_model, svm_up = svm_model)
resampled_upsampled =resamples(models_upsampled)
bwplot(resampled_upsampled, metric = c('Sensitivity','Specificity','Accuracy','AUC'), main = 'Resampled metrics on upsampled training data')

From the resampling results of different models, it is evident that the random forest model has the highest accuracy metrics. Also the confidence intervals don’t overlap. However, we know that based on the confusion matrix with test data, the random forest model overfitted and has low sensitivity and wasn’t able to predict the response variable “YES” well.

Models from downsampled data

models_downsampled = list(glm_down = glm_model, rf_down = rf_model_down, svm_down = svm_model_down)
resampled_downsampled =resamples(models_downsampled)
bwplot(resampled_downsampled, metric = c('Accuracy','AUC','Sensitivity','Specificity'), main = 'Resampled metrics on downsampled training data')

For downsampled training data, the models have similar metric values with overlapping confidence interval.

Compare SVM and GLM

compare_models(glm_model, svm_model_down)
## 
##  One Sample t-test
## 
## data:  x
## t = -1.875, df = 9, p-value = 0.09355
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.023767428  0.002224112
## sample estimates:
##   mean of x 
## -0.01077166

Best model

There is no significant difference on the resampling results from GLM and SVM models. Therefore, it would be better to pick GLM since it is more explainable and scalable than SVM.

When the resampling results from Random Forest model were compared with the resampling results from the other two models, it is evident that RF has higher accuracy metrics. The problem with RF lies more on the data it was trained on. Even though it has high resampled accuracy, it wasn’t able to correctly predict TRUE POSITIVES on test set.

One would choose the GLM over RF because of simplicity and accuracy. Moreover, the test predictions are acceptable even though it predicted more “YES”. The downside of the GLM model would be the cost of marketing campaign for FALSE POSITIVE clients.

Identify Important variables

plot(varImp(glm_model_up), top = 10, main = 'Top 10 important variables in the GLM')

plot(varImp(rf_model), top = 10, main = 'Top 10 important variables in the Random Forest Model')

In both models, both the duration and poutcome are included in the top 10 important variables.

IV Things to do/improve